Rhys Hewer
REMEMBER!! Combine the data sets at the end!
#load libraries
library(readxl)
library(dplyr)
library(ggplot2)
library(plotly)
library(RColorBrewer)
library(corrplot)
library(caret)
library(tidyr)
library(kableExtra)
library(parallel)
library(doParallel)
library(rattle)
library(rpart)
#read in data
setwd("C:/Users/rhysh/Google Drive/Data Science/Ubiqum/Project 2/Task 2")
incomplete <- read.csv("SurveyIncomplete.csv")
origdata <- read_xlsx("Survey_Key_and_Complete_Responses_excel.xlsx", sheet = 2)
data <- origdata
As we are working across 2 spreadsheets it is important to check to see if they are structured alike or whether manipulation will be required to ensure they have the same features.
#Check structure of data to see if alike
names(incomplete) == names(data)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The features match between the spreadsheets so no data manipulation is needed in that respect.
incomplete %>% sample_n(5)
## salary age elevel car zipcode credit brand
## 1950 72316.5 68 0 10 3 45951.63 0
## 1471 145072.1 62 3 6 0 303484.61 0
## 4335 51420.7 72 1 5 5 98634.11 0
## 4272 114306.0 22 4 13 5 383973.00 0
## 4592 102539.3 68 2 3 1 82246.30 0
data %>% sample_n(5)
## # A tibble: 5 x 7
## salary age elevel car zipcode credit brand
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 107543. 72 1 10 0 404918. 1
## 2 59258. 44 1 1 6 115942. 1
## 3 133413. 42 1 15 6 12841. 1
## 4 66381. 75 0 20 2 16870. 0
## 5 142847. 50 3 1 0 498545. 1
A quick look at a sample from both spreadsheets shows that they are very similar in composition.
str(incomplete)
## 'data.frame': 5000 obs. of 7 variables:
## $ salary : num 110500 140894 119160 20000 93956 ...
## $ age : int 54 44 49 56 59 71 32 33 32 58 ...
## $ elevel : int 3 4 2 0 1 2 1 4 1 2 ...
## $ car : int 15 20 1 9 15 7 17 17 19 8 ...
## $ zipcode: int 4 7 3 1 1 2 1 0 2 4 ...
## $ credit : num 354724 395015 122025 99630 458680 ...
## $ brand : int 0 0 0 0 0 0 0 0 0 0 ...
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 10000 obs. of 7 variables:
## $ salary : num 119807 106880 78021 63690 50874 ...
## $ age : num 45 63 23 51 20 56 24 62 29 41 ...
## $ elevel : num 0 1 0 3 3 3 4 3 4 1 ...
## $ car : num 14 11 15 6 14 14 8 3 17 5 ...
## $ zipcode: num 4 6 2 5 4 3 5 0 0 4 ...
## $ credit : num 442038 45007 48795 40889 352951 ...
## $ brand : num 0 1 0 1 0 1 1 1 0 1 ...
Looking at the structure shows that a few changes are needed in respect of the data types.
Overall, however, the datasets are sufficiently similar. I will take the strategy of splitting into training, test and final prediction sets. The training and testing sets will come from the complete responses data. Final prediction from the incomplete responses.
The exploratory data analysis will be performed on the complete data but any data transformations taking place on the training/testing sets will also need to be made on the Final prediction data prior to modelling this.
#check for NAs
data %>% is.na() %>% sum()
## [1] 0
incomplete %>% is.na() %>% sum()
## [1] 0
There are no missing values in either data set.
#data types
data$elevel <- data$elevel %>% as.factor()
data$car <- data$car %>% as.factor()
data$zipcode <- data$zipcode %>% as.factor()
data$brand <- data$brand %>% as.factor()
Education level, car owned, zip code and brand preference are converted from numeric to factors.
##check for outliers
numericVars <- Filter(is.numeric, data)
outliers <- numericVars %>% sapply(function(x) boxplot(x, plot=FALSE)$out) %>% str()
## List of 3
## $ salary: num(0)
## $ age : num(0)
## $ credit: num(0)
There are no outliers.
#EDA
##Key feature is brand preference, begin with exploring this value.
g6 <- ggplot(data, aes(brand, fill = brand)) +
geom_bar() +
theme_bw() +
scale_fill_brewer(palette="Dark2") +
xlab("Brand Preference") +
ylab("Frequency") +
ggtitle("Brand Preference Frequencies")
g6
##review other histograms for skewdness
histData <- origdata %>% select(-brand)
g8 <- ggplot(gather(histData), aes(value)) +
geom_histogram(bins = 10, fill = "#D95F02", colour = "white") +
theme_bw() +
facet_wrap(~key, scales = 'free_x') +
xlab("Value") +
ylab("Count") +
ggtitle("Histograms of Numeric Variables")
g8
## Plot Brand choice v other variables
g1 <- ggplot(data, aes(brand, salary, fill = brand)) +
geom_violin() +
theme_bw() +
scale_fill_brewer(palette="Dark2") +
xlab("Brand Preference") +
ylab("Salary ($)") +
ggtitle("Brand Preference v Salary")
g1 <- ggplotly(g1)
g1
g2 <- ggplot(data, aes(brand, age, fill = brand)) +
geom_violin() +
theme_bw() +
scale_fill_brewer(palette="Dark2") +
xlab("Brand Preference") +
ylab("Age") +
ggtitle("Brand Preference v Age")
g2 <- ggplotly(g2)
g2
g3 <- ggplot(data, aes(brand, elevel, colour = brand)) +
geom_count() +
theme_bw() +
scale_color_brewer(palette="Dark2") +
xlab("Brand Preference") +
ylab("Education Level") +
ggtitle("Brand Preference v Education Level")
g3
g4 <- ggplot(data, aes(brand, car, colour = brand)) +
geom_count() +
theme_bw() +
scale_color_brewer(palette="Dark2") +
xlab("Brand Preference") +
ylab("Car") +
ggtitle("Brand Preference v Car")
g4
g5 <- ggplot(data, aes(brand, zipcode, colour = brand)) +
geom_count() +
theme_bw() +
scale_color_brewer(palette="Dark2") +
xlab("Brand Preference") +
ylab("Zip Code") +
ggtitle("Brand Preference v Zip Code")
g5
g7 <- ggplot(data, aes(brand, credit, fill = brand)) +
geom_violin() +
theme_bw() +
scale_fill_brewer(palette="Dark2") +
xlab("Brand Preference") +
ylab("Credit") +
ggtitle("Brand Preference v Credit")
g7 <- ggplotly(g7)
g7
There is a general preference for brand 1 and salary seems the key feature.
Age and credit seem to have very limited influence on brand preference. Whilst there are patterns of preference within education level, car and zip code, salary appears to have the most striking impact.
Salaries ranging between 45k - 100k seem to have a preference for brand 0. Salaries outside of this range seem to have a preference for brand 1.
We can review the initial hypothesis and provide some information on which to make feature selection decisions using a decision tree.
featureDT <- rpart(brand ~ ., data = data)
featureDT$variable.importance
## age salary car zipcode credit elevel
## 2036.06398 1421.30552 78.59978 42.70644 21.20404 16.06770
fancyRpartPlot(featureDT)
The decision tree is formed exclusively of age and salary, suggesting these are the key features. This is reinforced by the variable importance information from within the model.
g12 <- ggplot(data, aes(salary, age, colour = brand)) +
geom_point(show.legend = FALSE) +
facet_grid(brand ~ .) +
theme_bw() +
scale_color_brewer(palette="Dark2") +
xlab("Salary") +
ylab("Age") +
ggtitle("Salary v Age by Brand Preference")
g12
We see that the interaction of age and salary do play a role in brand preference. There are clearly defined blocks of preference depending on age and salary. For example, between ages 60-80 earning between 25k-80k have a distinct preference for brand 0.
I have updated the initial hypothesis to now consider that age and salary are the key features in determining brand preference.
##Review correlation matrix
corrMatrix <- origdata %>% cor()
corrMatrix %>% corrplot.mixed()
Reviewing the correlation plot we see very little correlation between the features. This means that colinearity is not an issue that needs to be addressed.
#near zero variance
nzv <- data %>% nearZeroVar(saveMetrics = TRUE)
nzv
## freqRatio percentUnique zeroVar nzv
## salary 1.112069 97.53 FALSE FALSE
## age 1.178744 0.61 FALSE FALSE
## elevel 1.035946 0.05 FALSE FALSE
## car 1.026415 0.20 FALSE FALSE
## zipcode 1.020105 0.09 FALSE FALSE
## credit 1.057377 97.51 FALSE FALSE
## brand 1.643405 0.02 FALSE FALSE
There are no features with near zero variance.
The Random Forest algorithm has built-in feature selection so no feature selection required from that perspective.
For KNN I will take both a fully-featured dataset and a concentrated dataset (brand, salary, age) into the modelling phase and compare results.
#Creating Testing/Training sets
set.seed(111)
trainIndex <- createDataPartition(iris$Species, p = 0.75, list = FALSE)
training <- data[ trainIndex,]
testing <- data[-trainIndex,]
concData <- data %>% select(brand, age, salary)
training.conc <- concData[ trainIndex,]
testing.conc <- concData[-trainIndex,]
#set up parallel processing (requires parallel and doParallel libraries)
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
#Cross Validation 10 fold
fitControl<- trainControl(method = "cv", number = 10, savePredictions = TRUE, allowParallel = TRUE)
A test and traing set are created with all features. An additional training set is created with just Brand, Age and Salary. 10-fold cross validation is being used and parallel processing is being leveraged.
# Deploy KNN model
model.KNN.brand <- train(brand ~ ., data = training,
method = "knn", trControl = fitControl)
model.KNN.brand
## k-Nearest Neighbors
##
## 114 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 102, 103, 103, 102, 103, 102, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7386364 0.2455904
## 7 0.7196970 0.1720068
## 9 0.7015152 0.1015749
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
# Deploy KNN model with concentrated feature set
model.KNN.brand.conc <- train(brand ~ ., data = training.conc,
method = "knn", trControl = fitControl)
model.KNN.brand.conc
## k-Nearest Neighbors
##
## 114 samples
## 2 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 103, 103, 103, 103, 102, 103, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7346154 0.2830504
## 7 0.7451049 0.3028544
## 9 0.7298368 0.2675434
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
# Deploy KNN model with scaled features
model.KNN.brand.scale <- train(brand ~ ., data = training,
method = "knn", trControl = fitControl,
preProcess = c("center", "scale"))
model.KNN.brand.scale
## k-Nearest Neighbors
##
## 114 samples
## 6 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 103, 103, 102, 102, 103, 103, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6674242 -0.03808089
## 7 0.6750000 -0.05051582
## 9 0.6931818 -0.04586466
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
# Deploy KNN model with scaled features and concentrated feature set
model.KNN.brand.conc.scale <- train(brand ~ ., data = training.conc,
method = "knn", trControl = fitControl,
preProcess = c("center", "scale"))
model.KNN.brand.conc.scale
## k-Nearest Neighbors
##
## 114 samples
## 2 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (2), scaled (2)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 103, 103, 103, 103, 102, 101, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8679487 0.6380495
## 7 0.8581002 0.6001548
## 9 0.8246503 0.4795492
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
#tune best KNN model further
tGridKNN <- expand.grid(k = c(1,2,3,4,5,6))
finalKNN <- train(brand ~ ., data = training.conc,
method = "knn", trControl = fitControl,
preProcess = c("center", "scale"),
tuneGrid = tGridKNN)
finalKNN
## k-Nearest Neighbors
##
## 114 samples
## 2 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (2), scaled (2)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 101, 103, 103, 103, 103, 102, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.9113636 0.7810864
## 2 0.9030303 0.7682660
## 3 0.9303030 0.8325011
## 4 0.8688228 0.6379981
## 5 0.9036713 0.7356803
## 6 0.8597319 0.5987915
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.
By tuning the model I was able to increase the accuracy on the training sent to 93% and the Kappa to 0.83. This brings concerns of overfitting.
#Predictions on the test set (model = model name, testing = test set)
predictions.KNN <- predict(finalKNN, testing)
testing$predictions.KNN <- predictions.KNN
#Confusion matrix
confMatrixKNN <- confusionMatrix(testing$brand, testing$predictions.KNN)
confMatrixKNN
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3056 695
## 1 588 5547
##
## Accuracy : 0.8702
## 95% CI : (0.8634, 0.8768)
## No Information Rate : 0.6314
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7229
## Mcnemar's Test P-Value : 0.003083
##
## Sensitivity : 0.8386
## Specificity : 0.8887
## Pos Pred Value : 0.8147
## Neg Pred Value : 0.9042
## Prevalence : 0.3686
## Detection Rate : 0.3091
## Detection Prevalence : 0.3794
## Balanced Accuracy : 0.8636
##
## 'Positive' Class : 0
##
fourfoldplot(confMatrixKNN$table, conf.level = 0, margin = 1, main = "Confusion Matrix KNN")
The Accuracy of the model on the test set was 87% with a kappa of 0.72. These seem to be satisfactory outcomes. Reviewing the confusion matrix we see that the model is doing a fairly good job of classifying brand preference with a slight tendency to overpredict brand 0.
# Deploy RF model
tGridRF <- expand.grid(mtry = c(24,30,35))
model.RF.brand.tune <- train(brand ~ ., data = training,
method = "rf", trControl = fitControl,
tuneGrid = tGridRF)
model.RF.brand.tune
## Random Forest
##
## 114 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 102, 103, 102, 102, 103, 103, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 24 0.7734848 0.3644853
## 30 0.7833333 0.4161136
## 35 0.7833333 0.4161136
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 30.
## Deploy RF model on concentrated feature set
tGridRF.conc <- expand.grid(mtry = c(1,2,3,4,5,6))
model.RF.brand.conc <- train(brand ~ ., data = training.conc,
method = "rf", trControl = fitControl,
tuneGrid = tGridRF.conc)
model.RF.brand.conc
## Random Forest
##
## 114 samples
## 2 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 103, 103, 103, 103, 103, 103, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.9035548 0.7498966
## 2 0.8966200 0.7457665
## 3 0.8882867 0.7282665
## 4 0.8882867 0.7282665
## 5 0.8882867 0.7282665
## 6 0.8959790 0.7472085
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.
varImp(model.RF.brand.conc)
## rf variable importance
##
## Overall
## salary 100
## age 0
The Accuracy of the best Random Forest model was 90% with a Kappa of 0.75. These seem like acceptable values although there is the possibility of overfitting.
## Test Random Forest
predictions.RF <- predict(model.RF.brand.conc, testing)
testing$predictions.RF <- predictions.RF
#Confusion matrix
confMatrixRF <- confusionMatrix(testing$brand, testing$predictions.RF)
confMatrixRF
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2845 906
## 1 519 5616
##
## Accuracy : 0.8559
## 95% CI : (0.8488, 0.8627)
## No Information Rate : 0.6597
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6877
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8457
## Specificity : 0.8611
## Pos Pred Value : 0.7585
## Neg Pred Value : 0.9154
## Prevalence : 0.3403
## Detection Rate : 0.2878
## Detection Prevalence : 0.3794
## Balanced Accuracy : 0.8534
##
## 'Positive' Class : 0
##
fourfoldplot(confMatrixRF$table, conf.level = 0, margin = 1, main = "Confusion Matrix RF")
Random Forest gives an Accuracy of 86% and a kappa of 0.69.
##GBM
model.GBM.brand <- train(brand ~ ., data = training,
method = "gbm", trControl = fitControl,
verbose = FALSE)
model.GBM.brand
## Stochastic Gradient Boosting
##
## 114 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 103, 103, 103, 102, 103, 102, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7348485 0.1654335
## 1 100 0.7803030 0.3501018
## 1 150 0.7530303 0.2748936
## 2 50 0.8060606 0.4094712
## 2 100 0.8060606 0.4255128
## 2 150 0.8151515 0.4465689
## 3 50 0.8234848 0.4730280
## 3 100 0.8666667 0.6138182
## 3 150 0.8681818 0.6529269
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
#concentrated featureset
tGridGBM <- expand.grid(n.trees = c(150,200,250),
interaction.depth = c(3,4), shrinkage = 0.1,
n.minobsinnode = c(5,10))
model.GBM.brand.conc <- train(brand ~ ., data = training.conc,
method = "gbm", trControl = fitControl,
verbose = FALSE, tuneGrid = tGridGBM)
model.GBM.brand.conc
## Stochastic Gradient Boosting
##
## 114 samples
## 2 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 102, 103, 103, 102, 103, 103, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.minobsinnode n.trees Accuracy Kappa
## 3 5 150 0.8424242 0.5878979
## 3 5 200 0.8424242 0.5878979
## 3 5 250 0.8515152 0.6134793
## 3 10 150 0.8431818 0.5756526
## 3 10 200 0.8522727 0.6012340
## 3 10 250 0.8606061 0.6317896
## 4 5 150 0.8424242 0.5878979
## 4 5 200 0.8507576 0.6075058
## 4 5 250 0.8507576 0.6075058
## 4 10 150 0.8613636 0.6509819
## 4 10 200 0.8696970 0.6587734
## 4 10 250 0.8439394 0.5910516
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 200,
## interaction.depth = 4, shrinkage = 0.1 and n.minobsinnode = 10.
The best result for the tuned GBM model on the training set was Accuracy of 87% and 0.67 for Kappa.
## Test GBM
predictions.GBM <- predict(model.GBM.brand.conc, testing)
testing$predictions.GBM <- predictions.GBM
#Confusion matrix
confMatrixGBM <- confusionMatrix(testing$brand, testing$predictions.GBM)
confMatrixGBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3112 639
## 1 701 5434
##
## Accuracy : 0.8645
## 95% CI : (0.8575, 0.8711)
## No Information Rate : 0.6143
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7131
## Mcnemar's Test P-Value : 0.09564
##
## Sensitivity : 0.8162
## Specificity : 0.8948
## Pos Pred Value : 0.8296
## Neg Pred Value : 0.8857
## Prevalence : 0.3857
## Detection Rate : 0.3148
## Detection Prevalence : 0.3794
## Balanced Accuracy : 0.8555
##
## 'Positive' Class : 0
##
fourfoldplot(confMatrixGBM$table, conf.level = 0, margin = 1, main = "Confusion Matrix GBM")
For the GBM applied to the test set, the Accuracy was 86% with a Kappa of 0.72.